We present a clustering-based language model using word embeddings for textreadability prediction. Presumably, an Euclidean semantic space hypothesisholds true for word embeddings whose training is done by observing wordco-occurrences. We argue that clustering with word embeddings in the metricspace should yield feature representations in a higher semantic spaceappropriate for text regression. Also, by representing features in terms ofhistograms, our approach can naturally address documents of varying lengths. Anempirical evaluation using the Common Core Standards corpus reveals that thefeatures formed on our clustering-based language model significantly improvethe previously known results for the same corpus in readability prediction. Wealso evaluate the task of sentence matching based on semantic relatedness usingthe Wiki-SimpleWiki corpus and find that our features lead to superior matchingperformance.
展开▼